10/16/2020

Mind the gap

What we will cover

  1. Syllabus and expectations

  2. Counting and how we will proceed

  3. The many faces of data

What do you think?

  • Pilots should love to know the defects in their planes. Unreliability aids the informativeness needed to make difficult decisions, like landing in bad weather.

  • Organizations that have good rainy day funds are more valuable than those that use their cash to buy back stock. When the rain arrives, and it always does, cash will keep jobs and businesses afloat.

  • Probability does not exist, neither do means and standard deviations.

  • Any distribution of anything that can be described only by means and standard deviations is (plausibly) the least informative.

  • Why do cooks eat their own food (or should)? If the cook fails, the cook suffers and may die, thus eliminating very dangerous people from the planet. (a saying of Taleb)

Progress

noun or verb? or both?

Do until (days left in the semester == 0) OR (you have not completed your learning of a topic)

  1. Prepare for each week through reading, research, practice, and marching through the videos

  2. Live sessions to reinforce key topics and the solving of posed problems

  3. Questions and answers and follow-up on THE WALL for the next round

Magick?

Hammers

Getting to yes

modus ponens plausibility

major if A is true, then B is true
minor B is true
conclusion thus, A becomes more plausible

Truth or consequences

modus ponens truth table
P Q if P then Q P Q
TRUE TRUE TRUE TRUE TRUE
TRUE FALSE FALSE TRUE FALSE
FALSE TRUE TRUE FALSE TRUE
FALSE FALSE TRUE FALSE FALSE

Plausible deniability

modus tollens plausibility

major if A is true, then B is true
minor A is false
conclusion thus, B becomes less plausible

What’s the story?

  1. Concoct a data story

  2. Condition the story with data observations

  3. Critique the conditioned data story

Suppose this

  • We know there are positive and negative cases of a new virus in 4 zip codes.

  • Three data collectors observe at random and independently a positive, then a negative, then another positive zip code. The sites might all have the same zips or not.

  • We ask: how many of the 4 zip codes test positive?

Count until morale improves!

Yes, until something improves

In R this looks like …

library(tidyverse)
library(rethinking)
n <- 1000
n_success <- 6
n_trials  <- 8
(
  binomial_model <-
  tibble(p_grid = seq(from = 0, to = 1, length.out = n),
         # note we're still using a flat uniform prior
         prior  = 1) %>% 
  mutate(likelihood = dbinom(n_success, size = n_trials, prob = p_grid)) %>% 
  mutate(posterior = (likelihood * prior) / sum(likelihood * prior))
)
## # A tibble: 1,000 x 4
##     p_grid prior likelihood posterior
##      <dbl> <dbl>      <dbl>     <dbl>
##  1 0           1   0.        0.      
##  2 0.00100     1   2.81e-17  2.53e-19
##  3 0.00200     1   1.80e-15  1.62e-17
##  4 0.00300     1   2.04e-14  1.84e-16
##  5 0.00400     1   1.14e-13  1.03e-15
##  6 0.00501     1   4.36e-13  3.93e-15
##  7 0.00601     1   1.30e-12  1.17e-14
##  8 0.00701     1   3.27e-12  2.94e-14
##  9 0.00801     1   7.27e-12  6.55e-14
## 10 0.00901     1   1.47e-11  1.32e-13
## # ... with 990 more rows
summary( binomial_model )
##      p_grid         prior     likelihood         posterior        
##  Min.   :0.00   Min.   :1   Min.   :0.000000   Min.   :0.000e+00  
##  1st Qu.:0.25   1st Qu.:1   1st Qu.:0.003022   1st Qu.:2.722e-05  
##  Median :0.50   Median :1   Median :0.064970   Median :5.853e-04  
##  Mean   :0.50   Mean   :1   Mean   :0.111000   Mean   :1.000e-03  
##  3rd Qu.:0.75   3rd Qu.:1   3rd Qu.:0.220190   3rd Qu.:1.984e-03  
##  Max.   :1.00   Max.   :1   Max.   :0.311462   Max.   :2.806e-03

A picture helps

library(tidybayes) # Mode() helper function
library(plotly) # make the plot interactive
# how many samples would you like
n_samples <- 10000 # 1e4
# make it reproducible
set.seed(42) # Hitchhiker's Guide
samples <-
  binomial_model %>% 
  sample_n( size = n_samples, weight = posterior, replace = TRUE )
#
y_label <- "h = proportion of positive tests"
x_label <- "sample index"
title <- "Zip Code Tests: Bronx"
p_MAP <- Mode(samples$p_grid) #MAximum A Posteriori point estimate
title <- "Bronx Zip Code Tests"
x_label <- "proporation of zip codes testing positive"
y_label <- "posterior density"
plt <- samples %>% 
  ggplot(aes(x = p_grid)) +
  geom_density(fill = "blue", alpha = 0.3) +
  scale_x_continuous(x_label, limits = c(0, 1)) +
  geom_vline(xintercept=p_MAP, color = "orange", size = 1.3) +
  annotate( "text", x = 0.50, y = 2, label = paste0("MAP = ", round(p_MAP, 4)) ) +
  ylab(y_label) + xlab(x_label) +
  ggtitle(title)
# ggplotly(plt) Uncomment this to see the plot next

The plot thickens

The whole course is here

First

We set up a context, a story that has data associated with it:

  • 4 zip codes

    • positive and negative tests

The whole course is here

Second

We collected 3 observations in various zip codes, conditioned the data against the observations

  • 5 hypotheses, theories, models conjectured

    • Conditioned the models with observations

    • Counted (finally admitting this!) the ways a model is consistent with data

    • Consistency means can the data imply a model (LOGIC!)

The whole course is here

Third

We analyzed the ways and found the plausibility of each model; we then might have been forced by our employer to select the most plausible theory

  • Probability, Likelihood, is Plausibility

    • Ways normed by their sum

    • Only one theory is most probable: sample, sample, sample

The many faces of data

For example

  • Blue and red

  • Plausibility

  • Direction

  • Time

  • Troop strength

  • Temperature

  • Geo-coordinates

What’s a hierarchy?

A hierarchy is an analytical technique that takes a group of objects and asks two questions:

  1. How are the objects (nodes) related to one another (edges)? (just a network)

  2. What objects are parents (higher level) or children (lower level) of one another

A data hierarchy emerges

Load and inspect data

Summarise

## # A tibble: 6 x 4
##   CMPLNT_FR_DT PD_DESC                            lat   lon
##   <chr>        <chr>                            <dbl> <dbl>
## 1 12/10/2016   FRAUD,UNCLASSIFIED-FELONY         40.9 -73.9
## 2 12/3/2016    LARCENY,PETIT FROM AUTO           40.9 -73.9
## 3 11/16/2016   CRIMINAL MISCHIEF,UNCLASSIFIED 4  40.9 -73.8
## 4 10/27/2016   LARCENY,PETIT FROM OPEN AREAS,    40.9 -73.8
## 5 1/1/2016     LEAVING SCENE-ACCIDENT-PERSONA    40.8 -73.9
## 6 12/1/2015    FRAUD,UNCLASSIFIED-FELONY         40.9 -73.9

Scatter plot

Fierce?

What do we have to show for it?

  • A procedure of story, conditioning, choice

  • The many ways we can represent our physical reality with data, models, and plausibility

Next stop

  • Contingency tables

  • Logic and the Reverend Bayes

  • Check MOODLE for each week’s activities and requirements

  • In this fully remote course: every day is a class day!